Introduction

This report has been collected from two secondary schools, and from two classes, a math class and a Portuguese language class.The factors presented in the data set are social factors and will hopefully provide in depth analysis into teen alcohol consumption.

Problem Statement

With this information we will be able to determine if alcohol consumption is an issue for the students in this secondary school, and what relationship social factors have on consumption of alcohol. We assume the below questions will help identify a relationship between alcohol consumption, student performance, free time, interest in attending a university, and factors based on family cohabitation status and health status. Finding a relationship between these factors will allow us to predict the likelihood of a student consuming alcohol.Also, we will compare key differences in alcohol consumption between the math and language class.

# Please install packages if not in library  
library(ggplot2)
library(plyr)
library(grid)
library(gridExtra)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:gridExtra':
## 
##     combine
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(skimr)
library(tidyr)
# Read student-mat.csv, student-por.csv store it in dataframe and merge them into single dataframe
mat_students = read.csv("student-mat.csv")
por_students = read.csv("student-por.csv")

# Merge math and Portuguese class together into one data-frame
mat_por_students = rbind(mat_students, por_students)
# Check and omit missing data and omit them
if (sum(is.na(mat_por_students)) != 0) {
  mat_por_students <- na.omit(mat_por_students)
}
sum(is.na(mat_por_students))
## [1] 0

Data exploration.

Will be using dim function to get dimension of object
dim(mat_por_students) # dim gives dimensions of object 
## [1] 1044   33

The data set contains 33 variables and 1044 observations.

Structure and summary of the data-frame for both classes
summary(mat_por_students)
##  school   gender       age        address family.size
##  GP:772   F:591   Min.   :15.00   R:285   GT3:738    
##  MS:272   M:453   1st Qu.:16.00   U:759   LE3:306    
##                   Median :17.00                      
##                   Mean   :16.73                      
##                   3rd Qu.:18.00                      
##                   Max.   :22.00                      
##  parents.cohabitation.status mothers.education fathers.education   mothers.job 
##  A:121                       Min.   :0.000     Min.   :0.000     at_home :194  
##  T:923                       1st Qu.:2.000     1st Qu.:1.000     health  : 82  
##                              Median :3.000     Median :2.000     other   :399  
##                              Mean   :2.603     Mean   :2.388     services:239  
##                              3rd Qu.:4.000     3rd Qu.:3.000     teacher :130  
##                              Max.   :4.000     Max.   :4.000                   
##    fathers.job  reason.for.school.selection   guardian     traveltime   
##  at_home : 62   course    :430              father:243   Min.   :1.000  
##  health  : 41   home      :258              mother:728   1st Qu.:1.000  
##  other   :584   other     :108              other : 73   Median :1.000  
##  services:292   reputation:248                           Mean   :1.523  
##  teacher : 65                                            3rd Qu.:2.000  
##                                                          Max.   :4.000  
##    studytime    past.class.failures extra.school.support
##  Min.   :1.00   Min.   :0.0000      no :925             
##  1st Qu.:1.00   1st Qu.:0.0000      yes:119             
##  Median :2.00   Median :0.0000                          
##  Mean   :1.97   Mean   :0.2644                          
##  3rd Qu.:2.00   3rd Qu.:0.0000                          
##  Max.   :4.00   Max.   :3.0000                          
##  family.education.support extra.paid.classes extra.curricular.activities
##  no :404                  no :824            no :528                    
##  yes:640                  yes:220            yes:516                    
##                                                                         
##                                                                         
##                                                                         
##                                                                         
##  attended.nursery.school higher.education internet.access romantic.relationship
##  no :209                 no : 89          no :217         no :673              
##  yes:835                 yes:955          yes:827         yes:371              
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##  quality.of.family.relationships    freetime     going.out.with.friends
##  Min.   :1.000                   Min.   :1.000   Min.   :1.000         
##  1st Qu.:4.000                   1st Qu.:3.000   1st Qu.:2.000         
##  Median :4.000                   Median :3.000   Median :3.000         
##  Mean   :3.936                   Mean   :3.201   Mean   :3.156         
##  3rd Qu.:5.000                   3rd Qu.:4.000   3rd Qu.:4.000         
##  Max.   :5.000                   Max.   :5.000   Max.   :5.000         
##  weekday.alcohol.consumption weekend.alcohol.consumption health.status  
##  Min.   :1.000               Min.   :1.000               Min.   :1.000  
##  1st Qu.:1.000               1st Qu.:1.000               1st Qu.:3.000  
##  Median :1.000               Median :2.000               Median :4.000  
##  Mean   :1.494               Mean   :2.284               Mean   :3.543  
##  3rd Qu.:2.000               3rd Qu.:3.000               3rd Qu.:5.000  
##  Max.   :5.000               Max.   :5.000               Max.   :5.000  
##  school.absences  first.period.grade second.period.grade  final.grade   
##  Min.   : 0.000   Min.   : 0.00      Min.   : 0.00       Min.   : 0.00  
##  1st Qu.: 0.000   1st Qu.: 9.00      1st Qu.: 9.00       1st Qu.:10.00  
##  Median : 2.000   Median :11.00      Median :11.00       Median :11.00  
##  Mean   : 4.435   Mean   :11.21      Mean   :11.25       Mean   :11.34  
##  3rd Qu.: 6.000   3rd Qu.:13.00      3rd Qu.:13.00       3rd Qu.:14.00  
##  Max.   :75.000   Max.   :19.00      Max.   :19.00       Max.   :20.00
skim(mat_por_students) # Skim is an alternative to summary, which provides a broad view of the dataframe.
Data summary
Name mat_por_students
Number of rows 1044
Number of columns 33
_______________________
Column type frequency:
factor 17
numeric 16
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
school 0 1 FALSE 2 GP: 772, MS: 272
gender 0 1 FALSE 2 F: 591, M: 453
address 0 1 FALSE 2 U: 759, R: 285
family.size 0 1 FALSE 2 GT3: 738, LE3: 306
parents.cohabitation.status 0 1 FALSE 2 T: 923, A: 121
mothers.job 0 1 FALSE 5 oth: 399, ser: 239, at_: 194, tea: 130
fathers.job 0 1 FALSE 5 oth: 584, ser: 292, tea: 65, at_: 62
reason.for.school.selection 0 1 FALSE 4 cou: 430, hom: 258, rep: 248, oth: 108
guardian 0 1 FALSE 3 mot: 728, fat: 243, oth: 73
extra.school.support 0 1 FALSE 2 no: 925, yes: 119
family.education.support 0 1 FALSE 2 yes: 640, no: 404
extra.paid.classes 0 1 FALSE 2 no: 824, yes: 220
extra.curricular.activities 0 1 FALSE 2 no: 528, yes: 516
attended.nursery.school 0 1 FALSE 2 yes: 835, no: 209
higher.education 0 1 FALSE 2 yes: 955, no: 89
internet.access 0 1 FALSE 2 yes: 827, no: 217
romantic.relationship 0 1 FALSE 2 no: 673, yes: 371

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age 0 1 16.73 1.24 15 16 17 18 22 ▇▅▅▁▁
mothers.education 0 1 2.60 1.12 0 2 3 4 4 ▁▅▇▆▇
fathers.education 0 1 2.39 1.10 0 1 2 3 4 ▁▆▇▆▆
traveltime 0 1 1.52 0.73 1 1 1 2 4 ▇▅▁▁▁
studytime 0 1 1.97 0.83 1 1 2 2 4 ▅▇▁▂▁
past.class.failures 0 1 0.26 0.66 0 0 0 0 3 ▇▁▁▁▁
quality.of.family.relationships 0 1 3.94 0.93 1 4 4 5 5 ▁▁▂▇▅
freetime 0 1 3.20 1.03 1 3 3 4 5 ▁▃▇▆▂
going.out.with.friends 0 1 3.16 1.15 1 2 3 4 5 ▂▆▇▆▃
weekday.alcohol.consumption 0 1 1.49 0.91 1 1 1 2 5 ▇▂▁▁▁
weekend.alcohol.consumption 0 1 2.28 1.29 1 1 2 3 5 ▇▅▅▃▂
health.status 0 1 3.54 1.42 1 3 4 5 5 ▃▂▅▃▇
school.absences 0 1 4.43 6.21 0 0 2 6 75 ▇▁▁▁▁
first.period.grade 0 1 11.21 2.98 0 9 11 13 19 ▁▂▇▇▂
second.period.grade 0 1 11.25 3.29 0 9 11 13 19 ▁▂▇▇▂
final.grade 0 1 11.34 3.86 0 10 11 14 20 ▁▂▇▆▁
str(mat_por_students)
## 'data.frame':    1044 obs. of  33 variables:
##  $ school                         : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ gender                         : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
##  $ age                            : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address                        : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
##  $ family.size                    : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
##  $ parents.cohabitation.status    : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
##  $ mothers.education              : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ fathers.education              : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ mothers.job                    : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
##  $ fathers.job                    : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
##  $ reason.for.school.selection    : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
##  $ guardian                       : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
##  $ traveltime                     : int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime                      : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ past.class.failures            : int  0 0 3 0 0 0 0 0 0 0 ...
##  $ extra.school.support           : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
##  $ family.education.support       : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
##  $ extra.paid.classes             : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
##  $ extra.curricular.activities    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
##  $ attended.nursery.school        : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
##  $ higher.education               : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ internet.access                : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
##  $ romantic.relationship          : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ quality.of.family.relationships: int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime                       : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ going.out.with.friends         : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ weekday.alcohol.consumption    : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ weekend.alcohol.consumption    : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health.status                  : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ school.absences                : int  6 4 10 2 4 10 0 6 0 0 ...
##  $ first.period.grade             : int  5 5 7 15 6 15 12 6 16 14 ...
##  $ second.period.grade            : int  6 5 8 14 10 15 12 5 18 15 ...
##  $ final.grade                    : int  6 6 10 15 10 15 11 6 19 15 ...

Observations:

  1. The dataset is made up of data from two Portuguese schools, Gabriel Pereira HS and Mousinho da Silveira HS and it measures their performance in Math and Portuguese class.
  2. The dataset contains detailed observations for each student on social variables including, age, gender, grade performance, freetime, family relationship status,alcohol consumption, and the parent’s employment status, among others.
  3. The data consists of 1044 observations and 33 variables. 17 of the variables are character values and 16 variables are numerical values.

We will look at the Mean, Median, Min, Max number of alcoholic beverages consumed during school week and weekend

mean(mat_por_students$weekday.alcohol.consumption)
## [1] 1.494253
mean(mat_por_students$weekend.alcohol.consumption)
## [1] 2.284483
summary(mat_por_students$weekday.alcohol.consumption)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   1.000   1.494   2.000   5.000
summary(mat_por_students$weekend.alcohol.consumption)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   2.284   3.000   5.000

Observations

  1. Mean for weekday alcohol consumption is 1.4 while weekend alcohol consumption of students is 2.2. Hence, we can conclude that students drink alcohol more on weekends rather than on weekdays.
Head and Tail Functions

Using the “head” and “tail” functions will allow for insight in the first six rows of data (head) and the last six rows of data (tail).

head(mat_por_students)
tail(mat_por_students)

Observations

Head- The first 6 rows of data consist of mostly female students with parents who are still together, 3 of the mothers stay at home and all of the fathers work. Only 2 of the students have supplemental school support, and they all spend 2 or more hours studying a week. Additionally all 6 students want higher education, and all have at least one alcoholic beverage per week.

Tail- The last 6 rows of data consist 4 females all of their parents are together, in this case all of the mothers work while 5 of the fathers work and 1 stays home. The students in this sample have no supplemental school support, and spend 1-3 hours studying each week. All students want higher education and all consume 1 or more alcoholic beverages a week.

Exploration of Variables

Distribution of Gender in the Students

Will use table and ggplot functions to view gender of students

table(mat_por_students$gender)
## 
##   F   M 
## 591 453
Histogram of age and scaling by sex
ggplot(aes(x =age, fill=gender) , data = mat_por_students) + geom_histogram(binwidth=.1)+ ggtitle("Age and Gender of Students")

gender <- as.factor(mat_por_students$gender)

Will use a boxplot to get the average age of females and males in the dataset

plot(gender,mat_por_students$age)

Observations

1.The average age of students who are both female and male in this dataset is 17, the youngest is 15 and the oldest is 21. However, there is an outlier, a male student who is 22 years of age. 2. There are 158 more females than males in the dataset

Weekend and Weekday Alcohol Consumption

Will use table to review the number of students who have 1 or more alcoholic beverages per week and on weekends.

table(mat_por_students$weekday.alcohol.consumption)
## 
##   1   2   3   4   5 
## 727 196  69  26  26
prop.table(table(mat_por_students$weekday.alcohol.consumption))
## 
##          1          2          3          4          5 
## 0.69636015 0.18773946 0.06609195 0.02490421 0.02490421

Histogram descibing frequency of Weekly alcohol consumption of students

hist(mat_por_students$weekday.alcohol.consumption, main="Number of Alcoholic Beverages Consumed on Weekdays",xlab="Students", border="blue", col="purple",ylab="Weekely alcohol frequency")

table(mat_por_students$weekend.alcohol.consumption)
## 
##   1   2   3   4   5 
## 398 235 200 138  73
prop.table(table(mat_por_students$weekend.alcohol.consumption))
## 
##          1          2          3          4          5 
## 0.38122605 0.22509579 0.19157088 0.13218391 0.06992337
hist(mat_por_students$weekend.alcohol.consumption, main="Number of Alcoholic Beverages Consumed on Weekends",xlab="Students", border="blue", col="yellow",ylab="Weekend alcohol frequency")

Observations

On weekdays: 69% or 451 students consume 1 alcoholic beverage. 18% of students consume 2 alcoholic beverages and 6% consume 3 beverages. Approximately 4% of students consume 4 or 5 alcholic beverages. On weekends: 38% or 247 students consume 1 alcoholic beverage, we see an increase in consumption of 2(23%) and 3(18%) alcoholic beverages. 19% of the students consume 4-5 alcoholic beverages on weekends.

Weekday and Weekend Alcohol Consumption by Gender

We will look at the difference of alcohol consumption by gender by creating tables and bar charts for categorical variables

table(mat_por_students$weekday.alcohol.consumption,mat_por_students$gender)
##    
##       F   M
##   1 472 255
##   2  91 105
##   3  16  53
##   4   9  17
##   5   3  23
table(mat_por_students$weekend.alcohol.consumption,mat_por_students$gender)
##    
##       F   M
##   1 270 128
##   2 150  85
##   3 116  84
##   4  44  94
##   5  11  62
ggplot(aes(x =weekday.alcohol.consumption, fill=gender) , data = mat_por_students) + geom_histogram(binwidth=.1)+ ggtitle("Weekday Alcohol Consumption by Gender")

ggplot(aes(x =weekend.alcohol.consumption, fill=gender) , data = mat_por_students) + geom_histogram(binwidth=.1)+ ggtitle("Weekend Alcohol Consumption by Gender")

Observations:

There are 117 more females than males in the Portuguese language class, however we see that more males consume 3 or more alcoholic beverages during the weekday.On the weekend more than double the amount of males and females consume 3 or more alcoholic beverages.

Analyzing family relations of students.

Analyzing family size of students with their age

We will use a pie chart to highlight the family size of students by GT3 (greater than 3) or LE3 (less than 3)

table(mat_por_students$age,mat_por_students$family.size)
##     
##      GT3 LE3
##   15 139  55
##   16 201  80
##   17 200  77
##   18 143  79
##   19  44  12
##   20   8   1
##   21   1   2
##   22   2   0
familytable <- table(mat_por_students$family.size)
lbls <- paste(names(familytable), "\n", familytable, sep="")
pie(familytable, labels = lbls, 
   main="Pie Chart of Family Size")

Observations

70% of students have family size greater than 3 30% of students have family size less than 3

Analyzing family size of students with their gender

The stacked bar chart will provide better insight into the gender of students and family size

ggplot(aes(x = family.size), data = mat_por_students) + geom_bar(aes(fill=gender))+ ggtitle("Family Size in Relation to Gender of Students")

Observations:

  1. There are more Females having family size greater than 3.
  2. It can be observed that students most of the students have family size greater than 3.
  3. It can be observed that students from age-group 16-18 have family size greater than 3.
  4. The students who have a family size of 3 or less are equally divided male and female.

Analyzing parents cohabitation status of students.

We will plot parents cohabitation status with A= Apart and T= Together, along with gender to see any relation between variables

ggplot(aes(x = quality.of.family.relationships) , data = mat_por_students) + geom_bar(aes(fill=gender)) + facet_wrap(~ parents.cohabitation.status)+ ggtitle("Parents Cohabitation Status and Gender")

Studytime v.s parents cohabitation center.

ggplot(aes(x=studytime,fill=parents.cohabitation.status),data= mat_por_students) +geom_bar() +xlab('Weekly Study Time')+ scale_fill_manual(values=c( "#E69F00", "#56B4E9"))+ggtitle("Weekly Study Time and Parents Cohabitation Status")

table(mat_por_students$gender,mat_por_students$studytime)
##    
##       1   2   3   4
##   F 116 311 126  38
##   M 201 192  36  24

Observations:

  1. Majority of students have parents cohabitation status of living together
  2. Most students study 1 to 3 hours a week, with the majority of students studying 2 hours a week
  3. after checking the proportions of parents cohabitation status and study time, we find out that for students whose parents live together study more as compared to those whose parents live apart.

#Particiption in Extra- Curricular Activities

We will plot if students are involved in extra curricular activities (Yes or No) and analyze the relation to students grades

table(mat_por_students$extra.curricular.activities,mat_por_students$gender)
##      
##         F   M
##   no  329 199
##   yes 262 254

Observations:

Out of 1044 students, only 49% of students participate in extra-curricular activities, out of which 50% are females and 50% are males.

Analyzing extra-curricular activities w.r.t final grades of student.

ggplot(aes(x = final.grade), data = mat_por_students) + geom_bar(aes(fill=extra.curricular.activities))+ggtitle("Final Grade and Participation in Extra Curricular Activities")

Observations:

From above graph, it seems that participation in extra curricular activities does not have any effect on the grades of the student.

Distribution of students who wants to take higher education
table(mat_por_students$higher.education,mat_por_students$gender)
##      
##         F   M
##   no   39  50
##   yes 552 403

Observations:

It can observed that out of 1044 students, 935 students wish to take higher education, i.e almost 89% of students desire to take higher education. Out of which 52.8% of females and only 39% males wish to take higher education.

Does address and study time impact the student decision for pursuing higher education?
ggplot(aes(x = studytime), data =mat_por_students) +
  geom_histogram(aes(fill=gender), binwidth=0.1) + 
  facet_wrap(higher.education ~ address)+ ggtitle("Study Time and Higher Education by Urban/Rural Area")

Observation:

  1. From above graph it can be observed that, in both the places students are willing to take higher education.
  2. Females are more studious as compared to the males in the school.

How does students who desire to take higher education perform in their school?

ggplot(data=mat_por_students,aes(x=higher.education, y=final.grade))+ geom_boxplot() + facet_grid(~gender) + ggtitle("Higher education vs final grade of students")

Observations:

  1. It has been observed that students who intend to pursue a higher education perform better on average than those who do not.
Average number of failures
mean(mat_por_students$past.class.failures)
## [1] 0.2643678

Average past failures among the students is 0.2

Calculate Total alcohol consumption level of students
# Total alcohol consumption  = combine weekday and weekend alcohol consumption of the students.
mat_por_students$total_alcohol_consumption = rowSums(cbind(mat_por_students$weekday.alcohol.consumption,mat_por_students$weekend.alcohol.consumption))
mean(mat_por_students$total_alcohol_consumption)
## [1] 3.778736

The average alcohol consumption level of students is 3.778

Is there a relation between failure and alcohol consumption of students?

Covariance between the alcohol consumption and the failure of students

cov(mat_por_students$total_alcohol_consumption,mat_por_students$past.class.failures)
## [1] 0.1601812

Observations:

The covariance between 2 variables is 0.16, it means it has positive relation but the correlation very weak.

Boxplot to analyse the relation between the failure and alcohol consumption.

# Converting factor variable to categorical variable.
failures = factor(mat_por_students$past.class.failures,labels=c('Never Failed','Failed once',' failed twice','Failed Thrice')) 
boxplot(total_alcohol_consumption ~ failures,summary,data=mat_por_students)

##### Does alcohol consumption have any impact on the grades by students?

cor(mat_por_students$final.grade, mat_por_students$weekday.alcohol.consumption)
## [1] -0.1296421
cor(mat_por_students$final.grade, mat_por_students$weekend.alcohol.consumption)
## [1] -0.11574

Observation:

The correlations between alcohol consumption and final grades is negative, it implies that more alcohol consumption student level, decreases performance of students. But the relation is not very strong.

What is the relationship between health and attendance of student in class?
mat_por_students%>%
  group_by(gender)%>%
  ggplot(aes(x=factor(health.status), y=school.absences, color=gender))+
  geom_smooth(aes(group=gender), method="lm", se=FALSE)
## `geom_smooth()` using formula 'y ~ x'

Observation:

It is observed that female students have lower attendance on average and as the health scale increases, the absence decreases as expected for both male and female students.

Q1 ****

What are the factors affecting alcohol consumption level?

What is average total alcohol consumption level?

mean(mat_por_students$total_alcohol_consumption)
## [1] 3.778736

Total alcohol consumption level of students is 3.78

Association of family relationships with alcohol consumption level of students

weekday_alcohol_consumption_vs_famrel = ggplot(mat_por_students, aes(x=weekday.alcohol.consumption)) +
  geom_density(aes(color=as.factor(quality.of.family.relationships))) +
  ggtitle("Distribution of Students' Weekly Alchohol Consumption Level by Quality Family Relationship")

weekend_alcohol_consumption_vs_famrel = ggplot(mat_por_students, aes(x=weekend.alcohol.consumption)) +
  geom_density(aes(color=as.factor(quality.of.family.relationships))) +
  ggtitle("Distribution of Students' Weekend Alchohol Consumption Level by Quality Family Relationship")

grid.arrange(weekday_alcohol_consumption_vs_famrel, weekend_alcohol_consumption_vs_famrel)

Observations:

It can be observed that a students who don’t have enough good family relationship status, consume more alcohol for both in weekdays or weekends.

What is the relationship between alcohol consumption level of students and their setup in which they live i.e urban v.s rural and their performance in school

 weekday_alcohol_consumption_vs_address<-mat_por_students %>% 
  group_by(address)%>%
  ggplot(aes(x=factor(weekday.alcohol.consumption), y= final.grade,color=factor(weekday.alcohol.consumption)))+
           geom_jitter(alpha=0.6)+
    scale_x_discrete("Weekday Alcohol Consumption")+
    scale_y_continuous("Grade")+
    facet_grid(~address)+ ggtitle("Consumption of Alcohol based on Geographical Location")
 weekend_alcohol_consumption_vs_address<-mat_por_students %>% 
   group_by(address)%>%
   ggplot(aes(x=factor(weekend.alcohol.consumption), y= final.grade,color=factor(weekend.alcohol.consumption)))+
           geom_jitter(alpha=0.6)+
    scale_x_discrete("Weekend Alcohol Consumption")+
    scale_y_continuous("Grade")+
    facet_grid(~address)

grid.arrange(weekday_alcohol_consumption_vs_address,weekend_alcohol_consumption_vs_address)

Observations:

We can observe from above that student living in urban areas has higher alcohol consumption rate as compared to rural areas, also performance of students gradually start decreasing as their alcohol consumption increases.

Does age influence student alcohol consumption level on weekdays and weekends?

# Convert age factor to numeric
age = mat_por_students$age
ageI = as.numeric(age)

# Weekday alcohol consumption level vs Age
hist(ageI, main = "Alcohol Consumption on Weekdays", xlab = "age", ylab="weekday.alcohol.consumption",border="brown", col="red")

hist_weekend_alchol_consumption_vs_age = hist(ageI, main = "Alcohol Consumption on Weekends", xlab = "age", ylab="weekend.alcohol.consumption",border="brown", col="pink")

If age range is considered between 15-22 yrs old, it can be observed that students from age 15-18 consume more alcohol on weekdays and weekends.

Does failure rate determine student alcohol consumption?

ggplot(mat_por_students, aes(x=total_alcohol_consumption, y=past.class.failures, color=gender))+
  geom_jitter(alpha=0.9)+ theme_bw()+ xlab("Total alcohol consumption")+
  ylab("Past Failure rate on a scale of 5")+
  ggtitle("Total alcohol consumption vs past failure rate")

The past-class failure doesn’t seem affecting the alcohol consumption rate of students.

Does parent education affect the students alcohol consumption level?

# Parents average education level w.r.t student alcohol consumption rate
mat_por_students$parents_education = rowSums(cbind(mat_por_students$mothers.education,mat_por_students$fathers.education))
ggplot(mat_por_students, aes(y = as.numeric(total_alcohol_consumption) , x= parents_education)) + geom_col()

Observation:

We cannot find an exact pattern for association between alcohol consumption and parents education.

Did alcohol consumption have any effect on the student performance?

mat_por_students$weekday.alcohol.consumption.factor <- as.factor(mat_por_students$weekday.alcohol.consumption)
mat_por_students$weekend.alcohol.consumption.factor <- as.factor(mat_por_students$weekend.alcohol.consumption)
weekday_alcohol_consumption<-mat_por_students %>%
  ggplot(aes(x=weekday.alcohol.consumption.factor, y=final.grade, fill= weekday.alcohol.consumption.factor))+
  geom_boxplot()+
  coord_flip()+
      xlab("Work Day Alcohol consumption")+
      ylab("Grade")+
  facet_grid(~gender)
weekend_alcohol_consumption<-mat_por_students %>%
  ggplot(aes(x=weekend.alcohol.consumption.factor, y=final.grade, fill= weekend.alcohol.consumption.factor))+
  geom_boxplot()+
  coord_flip()+
      xlab("Week End Alcohol consumption")+
      ylab("Grade")+
  facet_grid(~gender)
grid.arrange(weekday_alcohol_consumption,weekend_alcohol_consumption)

Does weekday and weekend alcohol consumption has negative impact on grades of the school?

ggplot(mat_por_students,aes(x=weekday.alcohol.consumption,y=final.grade)) +
  geom_point() + geom_smooth(method = 'lm')
## `geom_smooth()` using formula 'y ~ x'

ggplot(mat_por_students,aes(x=weekend.alcohol.consumption,y=final.grade)) +
  geom_point() + geom_smooth(method = 'lm')
## `geom_smooth()` using formula 'y ~ x'

Observations:

It can be illustrated that students who consume more alcohol, perform poor in examination.

Different grades of school w.r.t to weekly alcohol consumption level consumption.

boxplot1 <- ggplot(mat_por_students, aes(x=weekday.alcohol.consumption, y=first.period.grade, fill=weekday.alcohol.consumption))+
  geom_boxplot(aes(group = weekday.alcohol.consumption))+
  theme_test()+
  xlab("Weekly Alcohol consumption")+
  ylab("First period grade")+
  ggtitle("First period grade") + theme(legend.position = "none")

boxplot2 <- ggplot(mat_por_students, aes(x=weekday.alcohol.consumption, y=second.period.grade, fill=weekday.alcohol.consumption))+
  geom_boxplot(aes(group = weekday.alcohol.consumption))+
  theme_dark()+
  xlab("Weekly Alcohol consumption")+
  ylab("Second period grade")+
  ggtitle("Second period grade") + theme(legend.position = "none")

boxplot3 <- ggplot(mat_por_students, aes(x=weekday.alcohol.consumption, y=final.grade, fill=weekday.alcohol.consumption))+
  geom_boxplot(aes(group = weekday.alcohol.consumption))+
  theme_linedraw()+
  xlab("Weekly Alcohol consumption")+
  ylab("Final period grade")+
  ggtitle("Final period grade") + theme(legend.position = "none")

grid.arrange(boxplot1, boxplot2, boxplot3, ncol = 3,top=textGrob("Weekly Alcohol consumption level VS Grades ",gp=gpar(fontsize=10,font=4))
)

Observations:

It can be clearly observed that as weekly alcohol consumption increases, performance of student decreases.

Different grades of school w.r.t to weekend alcohol consumption level consumption.

boxplot1 <- ggplot(mat_por_students, aes(x=weekend.alcohol.consumption, y=first.period.grade, fill=weekend.alcohol.consumption))+
  geom_boxplot(aes(group = weekend.alcohol.consumption))+
  theme_test()+
  xlab("Weekend Alcohol consumption")+
  ylab("First period grade")+
  ggtitle("First period grade") + theme(legend.position = "none")

boxplot2 <- ggplot(mat_por_students, aes(x=weekend.alcohol.consumption, y=second.period.grade, fill=weekend.alcohol.consumption))+
  geom_boxplot(aes(group = weekend.alcohol.consumption))+
  theme_dark()+
  xlab("Weekend Alcohol consumption")+
  ylab("Second period grade")+
  ggtitle("Second period grade") + theme(legend.position = "none")

boxplot3 <- ggplot(mat_por_students, aes(x=weekend.alcohol.consumption, y=final.grade, fill=weekend.alcohol.consumption))+
  geom_boxplot(aes(group = weekend.alcohol.consumption))+
  theme_linedraw()+
  xlab("Weekend Alcohol consumption")+
  ylab("Final period grade")+
  ggtitle("Final period grade") + theme(legend.position = "none")

grid.arrange(boxplot1, boxplot2, boxplot3, ncol = 3,top=textGrob("Weekend Alcohol consumption level VS Grades ",gp=gpar(fontsize=10,font=4))
)

Observations:

It can be clearly observed that as weekend alcohol consumption increases, performance of student decreases.

Does freetime increases alcohol consumption of students?
weekly_alc_freetime = ggplot(mat_por_students,aes(x=weekday.alcohol.consumption,y=freetime)) +
  geom_point() + geom_smooth(method = 'lm') + ggtitle("Weekly alcohol consumption level v.s freetime")

weekend_alc_freetime = ggplot(mat_por_students,aes(x=weekend.alcohol.consumption,y=freetime)) +
  geom_point() + geom_smooth(method = 'lm') + ggtitle("Weekend alcohol consumption level v.s freetime")

grid.arrange(weekly_alc_freetime, weekend_alc_freetime , ncol = 2,top=textGrob("Alcohol consumption level VS freetime ",gp=gpar(fontsize=10,font=4))
)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

Observations:

  1. It can be determined that when student has more freetime, they can drink more alcohol.

Does going out with friends increases the tendency of student to drink alcohol?

weekly_alc_going_out = ggplot(mat_por_students,aes(x=weekday.alcohol.consumption,y=going.out.with.friends)) +
  geom_point() + geom_smooth(method = 'lm') + ggtitle("Weekly alcohol consumption level v.s going out with friends")

weekend_alc_going_out = ggplot(mat_por_students,aes(x=weekend.alcohol.consumption,y=going.out.with.friends)) +
  geom_point() + geom_smooth(method = 'lm') + ggtitle("Weekend alcohol consumption level v.s going out with friends")

grid.arrange(weekly_alc_going_out, weekend_alc_going_out , ncol = 2,top=textGrob("Alcohol consumption level VS going out with friends ",gp=gpar(fontsize=10,font=4))
)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

Observations:

It can be observed that going out with friends increases the alcohol consumption level of students for both weekend and weekdays.

Is there a correlation between studytime and alcohol consumption?

weekday_alc_study_time = ggplot(mat_por_students,aes(x=weekday.alcohol.consumption,y=studytime)) +
  geom_point() + geom_smooth(method = 'lm') + ggtitle("Weekend alcohol consumption level v.s going out with friends")

weekend_alc_study_time = ggplot(mat_por_students,aes(x=weekend.alcohol.consumption,y=studytime)) +
  geom_point() + geom_smooth(method = 'lm') + ggtitle("Weekend alcohol consumption level v.s going out with friends")

grid.arrange(weekday_alc_study_time, weekend_alc_study_time , ncol = 2,top=textGrob("Alcohol consumption level VS studytime ",gp=gpar(fontsize=10,font=4))
)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

Observations:

It can be observed that alcohol consumption and studytime have negative correlation, i.e students studytime decreases, alcohol consumtion increases.

From above all factors, we found out that, studytime, freetime, going out with friend, address, family relations and student grades have significant impact on the alcohol consumption habit of students. Let us build a linear regression model to describe their correlation.

Observations:

Since all of the p-values are less than 0.05, we should rule out the null hypothesis that the coefficient is zero for each variable. All variables aren’t entirely irrelevant to alcohol intake on weekdays. Regression Model= 5.82e-05+5.07e-11schoolMS+(-2e-16)genderM+4.69e-07age+.015AddressU+.022StudyTime+.013ExtraCurcActiv.+.017Freetime(-2e-16)GoingoutwithFriends+(-2e-16)*Weekend Alcohol Consumption

Linear regression model for weekend alcohol consumption levels.
weekend_alcohol_consumption_survey = lm(weekend.alcohol.consumption~., data= mat_por_students)
coef(weekend_alcohol_consumption_survey)
##                           (Intercept)                              schoolMS 
##                          1.859591e-14                         -4.433036e-15 
##                               genderM                                   age 
##                         -1.260509e-14                         -8.947536e-16 
##                              addressU                        family.sizeLE3 
##                          1.039537e-15                          2.772429e-16 
##          parents.cohabitation.statusT                     mothers.education 
##                         -4.273212e-16                         -4.749574e-17 
##                     fathers.education                     mothers.jobhealth 
##                          4.852776e-18                          3.403729e-17 
##                      mothers.jobother                   mothers.jobservices 
##                         -2.955321e-16                         -1.033102e-15 
##                    mothers.jobteacher                     fathers.jobhealth 
##                         -9.868835e-16                          6.623150e-16 
##                      fathers.jobother                   fathers.jobservices 
##                          1.095695e-15                          5.509421e-16 
##                    fathers.jobteacher       reason.for.school.selectionhome 
##                          5.047604e-16                         -4.735944e-16 
##      reason.for.school.selectionother reason.for.school.selectionreputation 
##                          1.121461e-15                         -5.552352e-16 
##                        guardianmother                         guardianother 
##                          2.226133e-16                         -4.449912e-16 
##                            traveltime                             studytime 
##                         -3.636774e-16                          1.841462e-16 
##                   past.class.failures               extra.school.supportyes 
##                          1.047072e-15                          2.234714e-16 
##           family.education.supportyes                 extra.paid.classesyes 
##                         -1.664347e-16                          1.058920e-15 
##        extra.curricular.activitiesyes            attended.nursery.schoolyes 
##                          9.744521e-16                          4.888939e-16 
##                   higher.educationyes                    internet.accessyes 
##                          2.472096e-16                          4.860300e-16 
##              romantic.relationshipyes       quality.of.family.relationships 
##                         -5.971048e-16                          5.183392e-16 
##                              freetime                going.out.with.friends 
##                          4.724658e-16                         -2.266561e-15 
##           weekday.alcohol.consumption                         health.status 
##                         -1.000000e+00                         -7.019046e-17 
##                       school.absences                    first.period.grade 
##                          1.081548e-17                         -3.012516e-19 
##                   second.period.grade                           final.grade 
##                         -2.105156e-16                          1.942042e-16 
##             total_alcohol_consumption                     parents_education 
##                          1.000000e+00                                    NA 
##   weekday.alcohol.consumption.factor2   weekday.alcohol.consumption.factor3 
##                          8.311543e-16                          1.594434e-17 
##   weekday.alcohol.consumption.factor4   weekday.alcohol.consumption.factor5 
##                         -2.954519e-16                                    NA 
##   weekend.alcohol.consumption.factor2   weekend.alcohol.consumption.factor3 
##                          2.552545e-16                          6.004324e-16 
##   weekend.alcohol.consumption.factor4   weekend.alcohol.consumption.factor5 
##                         -5.463281e-18                                    NA
summary(weekend_alcohol_consumption_survey)
## 
## Call:
## lm(formula = weekend.alcohol.consumption ~ ., data = mat_por_students)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -5.472e-14 -1.130e-15 -1.800e-17  9.880e-16  1.541e-13 
## 
## Coefficients: (3 not defined because of singularities)
##                                         Estimate Std. Error    t value Pr(>|t|)
## (Intercept)                            1.860e-14  3.766e-15  4.938e+00 9.25e-07
## schoolMS                              -4.433e-15  5.317e-16 -8.338e+00 2.50e-16
## genderM                               -1.261e-14  4.540e-16 -2.777e+01  < 2e-16
## age                                   -8.948e-16  1.866e-16 -4.796e+00 1.87e-06
## addressU                               1.040e-15  4.961e-16  2.095e+00  0.03639
## family.sizeLE3                         2.772e-16  4.477e-16  6.190e-01  0.53588
## parents.cohabitation.statusT          -4.273e-16  6.455e-16 -6.620e-01  0.50814
## mothers.education                     -4.750e-17  2.814e-16 -1.690e-01  0.86602
## fathers.education                      4.853e-18  2.505e-16  1.900e-02  0.98455
## mothers.jobhealth                      3.404e-17  9.907e-16  3.400e-02  0.97260
## mothers.jobother                      -2.955e-16  5.848e-16 -5.050e-01  0.61340
## mothers.jobservices                   -1.033e-15  6.935e-16 -1.490e+00  0.13663
## mothers.jobteacher                    -9.869e-16  9.174e-16 -1.076e+00  0.28229
## fathers.jobhealth                      6.623e-16  1.341e-15  4.940e-01  0.62140
## fathers.jobother                       1.096e-15  8.652e-16  1.266e+00  0.20564
## fathers.jobservices                    5.509e-16  9.052e-16  6.090e-01  0.54288
## fathers.jobteacher                     5.048e-16  1.209e-15  4.180e-01  0.67637
## reason.for.school.selectionhome       -4.736e-16  5.121e-16 -9.250e-01  0.35528
## reason.for.school.selectionother       1.121e-15  6.948e-16  1.614e+00  0.10681
## reason.for.school.selectionreputation -5.552e-16  5.369e-16 -1.034e+00  0.30130
## guardianmother                         2.226e-16  4.902e-16  4.540e-01  0.64981
## guardianother                         -4.450e-16  9.404e-16 -4.730e-01  0.63619
## traveltime                            -3.637e-16  2.975e-16 -1.222e+00  0.22189
## studytime                              1.841e-16  2.599e-16  7.090e-01  0.47876
## past.class.failures                    1.047e-15  3.486e-16  3.004e+00  0.00273
## extra.school.supportyes                2.235e-16  6.577e-16  3.400e-01  0.73408
## family.education.supportyes           -1.664e-16  4.247e-16 -3.920e-01  0.69524
## extra.paid.classesyes                  1.059e-15  4.996e-16  2.120e+00  0.03428
## extra.curricular.activitiesyes         9.745e-16  4.071e-16  2.394e+00  0.01686
## attended.nursery.schoolyes             4.889e-16  4.973e-16  9.830e-01  0.32582
## higher.educationyes                    2.472e-16  7.724e-16  3.200e-01  0.74900
## internet.accessyes                     4.860e-16  5.225e-16  9.300e-01  0.35249
## romantic.relationshipyes              -5.971e-16  4.248e-16 -1.406e+00  0.16013
## quality.of.family.relationships        5.183e-16  2.172e-16  2.386e+00  0.01721
## freetime                               4.725e-16  2.091e-16  2.260e+00  0.02406
## going.out.with.friends                -2.267e-15  1.999e-16 -1.134e+01  < 2e-16
## weekday.alcohol.consumption           -1.000e+00  6.045e-16 -1.654e+15  < 2e-16
## health.status                         -7.019e-17  1.431e-16 -4.900e-01  0.62395
## school.absences                        1.082e-17  3.365e-17  3.210e-01  0.74798
## first.period.grade                    -3.013e-19  1.326e-16 -2.000e-03  0.99819
## second.period.grade                   -2.105e-16  1.668e-16 -1.262e+00  0.20721
## final.grade                            1.942e-16  1.243e-16  1.562e+00  0.11864
## total_alcohol_consumption              1.000e+00  2.893e-16  3.456e+15  < 2e-16
## parents_education                             NA         NA         NA       NA
## weekday.alcohol.consumption.factor2    8.312e-16  6.485e-16  1.282e+00  0.20028
## weekday.alcohol.consumption.factor3    1.594e-17  1.054e-15  1.500e-02  0.98794
## weekday.alcohol.consumption.factor4   -2.955e-16  1.582e-15 -1.870e-01  0.85190
## weekday.alcohol.consumption.factor5           NA         NA         NA       NA
## weekend.alcohol.consumption.factor2    2.553e-16  5.435e-16  4.700e-01  0.63873
## weekend.alcohol.consumption.factor3    6.004e-16  6.865e-16  8.750e-01  0.38198
## weekend.alcohol.consumption.factor4   -5.463e-18  8.757e-16 -6.000e-03  0.99502
## weekend.alcohol.consumption.factor5           NA         NA         NA       NA
##                                          
## (Intercept)                           ***
## schoolMS                              ***
## genderM                               ***
## age                                   ***
## addressU                              *  
## family.sizeLE3                           
## parents.cohabitation.statusT             
## mothers.education                        
## fathers.education                        
## mothers.jobhealth                        
## mothers.jobother                         
## mothers.jobservices                      
## mothers.jobteacher                       
## fathers.jobhealth                        
## fathers.jobother                         
## fathers.jobservices                      
## fathers.jobteacher                       
## reason.for.school.selectionhome          
## reason.for.school.selectionother         
## reason.for.school.selectionreputation    
## guardianmother                           
## guardianother                            
## traveltime                               
## studytime                                
## past.class.failures                   ** 
## extra.school.supportyes                  
## family.education.supportyes              
## extra.paid.classesyes                 *  
## extra.curricular.activitiesyes        *  
## attended.nursery.schoolyes               
## higher.educationyes                      
## internet.accessyes                       
## romantic.relationshipyes                 
## quality.of.family.relationships       *  
## freetime                              *  
## going.out.with.friends                ***
## weekday.alcohol.consumption           ***
## health.status                            
## school.absences                          
## first.period.grade                       
## second.period.grade                      
## final.grade                              
## total_alcohol_consumption             ***
## parents_education                        
## weekday.alcohol.consumption.factor2      
## weekday.alcohol.consumption.factor3      
## weekday.alcohol.consumption.factor4      
## weekday.alcohol.consumption.factor5      
## weekend.alcohol.consumption.factor2      
## weekend.alcohol.consumption.factor3      
## weekend.alcohol.consumption.factor4      
## weekend.alcohol.consumption.factor5      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.16e-15 on 995 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 9.457e+29 on 48 and 995 DF,  p-value: < 2.2e-16
Observations:
  1. Weekday alcohol intake is better explained by the amount at which students go out. Holding all other factors unchanged, a one-unit rise in going out results in a 0.175 increase in weekday alcohol intake. Although weekday alcohol intake is positively associated with frequency of going out, amount of free time, and fitness when all other variables are held stable, the nature of family relationships is negatively correlated with the response variable. The more they went out, the more alcohol they drank (which is equivalent to how much spare time they had and how healthy they were), but the healthier their family relationships were, the less alcohol they drank.

  2. For weekend alcohol consumption, family relationships, studytime and going out with friends are observed to be more significant variables.

Q2. Identify the Factors Impacting Student Performance.

Male vs Female students that failed in examination

ggplot(mat_por_students,aes(x=gender,y=past.class.failures,fill=gender))+
  geom_bar(stat="identity")+ggtitle("Number of males versus females who failed in the final examination")

table(mat_por_students$past.class.failures,mat_por_students$gender)
##    
##       F   M
##   0 497 364
##   1  65  55
##   2  18  15
##   3  11  19
table(mat_por_students$age,mat_por_students$past.class.failures)
##     
##        0   1   2   3
##   15 179   7   5   3
##   16 249  22   7   3
##   17 237  27   3  10
##   18 175  35   7   5
##   19  18  26   6   6
##   20   3   3   3   0
##   21   0   0   2   1
##   22   0   0   0   2

Barplot to describe past failures of students in school.

ggplot(aes(x = failures), data = mat_por_students )  + geom_bar(aes(fill = gender)) + ggtitle("Past failures of students")

Impact of failures on final grades of students

plot(x=failures, y=mat_por_students$final.grade, xlab = "Failures", ylab = "Rank", col = "green")

cor(mat_por_students$final.grade, mat_por_students$past.class.failures)
## [1] -0.3831453

Observations:

  1. From above, we can illustrate that, Females have less number of failures as compared to Males.
  2. Maximum students have not failed in their previous examinations.
  3. There is a negative correlation between past failures and grades of examination, i.e if student has more failures, the average grade of students is also less.
Does extra-school support help students get good grades in school?
ggplot(data = mat_por_students,aes(x = extra.school.support,y=final.grade,fill=extra.school.support))+
  geom_boxplot(show.legend = F) + labs(x="Extra Educational Support",y="Final Score")+ ggtitle("Extra Educational Support and Final Grade")

Observations:

Surprisingly, from above diagram, it can be illustrated that students who have got extra school support have scored less as compared to other students who got extra support from school.

Boxplot used to analyse the relation between type of address and final grades of the students

ggplot(data= mat_por_students, aes(x=address, y=final.grade,fill=address))+
  geom_boxplot() + geom_boxplot(show.legend = F) + labs(x="Address U: Urban, R: Rural",y="Final Score") + scale_fill_manual(values=c("#999999", "#E69F00", "#56B4E9"))+ ggtitle("Rural/Urban Living Area in relation to Final Grades")

Observation:

In comparison to rural students, urban students appear to perform better.

Does internet access at home, helps students to perform better in their exam?

mat_por_students%>% 
  group_by(internet.access)%>%
  ggplot(aes(x=final.grade, fill=internet.access))+
  geom_density(alpha=0.5)+ ggtitle("Internet Access and Final Grade")

Observation:

  1. Having internet access at home, helps students to achieve good grades
  2. Most of the students have internet connections at home and the failures are also less for such students.

Average Final Score across different weekend and weekday alcohol consumption levels

mat_por_students%>% group_by(weekday.alcohol.consumption)%>% aggregate(final.grade~weekday.alcohol.consumption, data=., mean)%>%
  arrange(desc(final.grade))
mat_por_students%>%
   group_by(weekend.alcohol.consumption)%>%
   aggregate(final.grade~weekend.alcohol.consumption, data=., mean)%>%
   arrange(desc(final.grade))

Observations:

  1. We can conclude from above tables that final grades tend to decrease when alcohol consumption increase for both in weekdays and weekends.

Does student’s relationship status have any effect on their grades?

ggplot(data = mat_por_students,aes(x=final.grade))+ geom_bar(aes(fill=romantic.relationship),alpha=.9)+ labs(y="Proportion",x="Final Grade")

Observation:

The percentage of students in love relationships is lower than the number of students who are not in love relationships. The distribution of final grades among students, whether in love or not, is very similar in the diagram above, but high final grades appear to be owned by those students who are not in a romantic relationship.

Relationship between going out and average grade of the students

mat_por_students$going.out.with.friends.factor <- as.factor(mat_por_students$going.out.with.friends)
mat_por_students%>%
   group_by(going.out.with.friends.factor)%>%
   summarise(AverageScore= mean(final.grade))%>%
  arrange(desc(going.out.with.friends.factor
))

Observation:

It is observed that going out with friends has an impact on average score of students. As going out friends increases above 3, it decreases the average score of students.

Do students who desire to take higher education, perform better in exam?

ggplot(data = mat_por_students,aes(x=final.grade))+
  geom_density(aes(fill=higher.education))+ labs(y="Frequency",x="Final Score")+ scale_color_grey() + theme_classic()+ ggtitle("Student Performance and the want for Higher Education")

Observations:

It can be observed that, the students who desire to take higher education perform well in their examination as compared the other students.

Studytime w.r.t to final grade

plot(x=mat_por_students$studytime, y=mat_por_students$final.grade, xlab = "Study time", ylab = "Final grade", col = "brown")

Correlation between Study time and Final grade of students.

cor(mat_por_students$studytime,mat_por_students$final.grade)
## [1] 0.1616289

From above graph and correlation, we can illustrate that Study time can have some effect on final grade, but not very strong.

Impact of mother’s job in final grades
ggplot(mat_por_students, aes(x="", y=mothers.job, fill=mothers.job)) +
  geom_bar(stat="identity", width=1) +
  coord_polar("y", start=0)

ggplot(data= mat_por_students, aes(x=mothers.job, y=final.grade,fill=mothers.job))+
 geom_boxplot(show.legend = F)

Observations:

  1. It can be observed that most of the students whose mother work in health sector and work as teacher perform good on average as compared to other students.
  2. As compared to others, a student whose mother is at home, gets less marks as compared to other students.

How does travelling affects the student performance?

ggplot(mat_por_students, aes(x=traveltime, y=final.grade)) + 
  geom_point()+ geom_density_2d()+
  geom_smooth(method=lm)+ ggtitle("Travel Time to School and Final Grade")
## `geom_smooth()` using formula 'y ~ x'

Observations:

  1. It can be observed that student who travels less tends to perform better in school.

Does extra paid classes helps students to perform better in school?

mat_por_students%>% group_by(extra.paid.classes)%>% aggregate(final.grade~extra.paid.classes, data=., mean)%>%
  arrange(desc(final.grade))

Observations:

Surprisingly, students who havent taken extra paid classes, perform better than those who have taken extra paid classes.

Divide data into training and testing dataset.

80% of data will be training data set and 20% will be testing dataset.

set.seed (199)
trainindex=sample(nrow(mat_por_students),nrow(mat_por_students)*.8)
train_data <- mat_por_students[trainindex, ]
test_data <- mat_por_students[-trainindex, ]
str(train_data)
## 'data.frame':    835 obs. of  38 variables:
##  $ school                            : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 2 1 2 ...
##  $ gender                            : Factor w/ 2 levels "F","M": 2 2 1 1 2 2 1 2 2 1 ...
##  $ age                               : int  16 17 18 17 15 18 18 17 18 19 ...
##  $ address                           : Factor w/ 2 levels "R","U": 2 1 2 1 2 2 2 2 2 1 ...
##  $ family.size                       : Factor w/ 2 levels "GT3","LE3": 1 1 2 2 1 1 1 2 1 1 ...
##  $ parents.cohabitation.status       : Factor w/ 2 levels "A","T": 2 2 1 2 2 2 2 2 2 2 ...
##  $ mothers.education                 : int  4 2 4 3 4 2 4 3 4 2 ...
##  $ fathers.education                 : int  4 2 4 1 3 1 4 1 2 3 ...
##  $ mothers.job                       : Factor w/ 5 levels "at_home","health",..: 5 3 2 4 5 4 2 4 5 4 ...
##  $ fathers.job                       : Factor w/ 5 levels "at_home","health",..: 5 4 3 3 3 4 2 4 3 3 ...
##  $ reason.for.school.selection       : Factor w/ 4 levels "course","home",..: 2 3 2 4 1 3 4 1 2 1 ...
##  $ guardian                          : Factor w/ 3 levels "father","mother",..: 2 2 2 2 2 2 1 2 2 2 ...
##  $ traveltime                        : int  1 2 1 2 2 1 1 2 1 1 ...
##  $ studytime                         : int  2 1 2 4 2 1 2 1 2 3 ...
##  $ past.class.failures               : int  0 0 0 0 0 1 1 0 0 1 ...
##  $ extra.school.support              : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 2 1 1 1 ...
##  $ family.education.support          : Factor w/ 2 levels "no","yes": 2 1 2 2 2 1 2 1 2 1 ...
##  $ extra.paid.classes                : Factor w/ 2 levels "no","yes": 2 1 2 2 2 1 1 1 2 1 ...
##  $ extra.curricular.activities       : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 2 1 2 2 ...
##  $ attended.nursery.school           : Factor w/ 2 levels "no","yes": 2 1 2 2 2 1 2 1 2 1 ...
##  $ higher.education                  : Factor w/ 2 levels "no","yes": 2 1 2 2 2 1 2 2 2 2 ...
##  $ internet.access                   : Factor w/ 2 levels "no","yes": 2 1 2 1 2 2 2 2 2 2 ...
##  $ romantic.relationship             : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 2 1 2 1 ...
##  $ quality.of.family.relationships   : int  4 5 4 3 5 3 2 2 4 5 ...
##  $ freetime                          : int  4 2 2 1 4 2 4 4 3 4 ...
##  $ going.out.with.friends            : int  5 2 4 2 3 5 4 5 2 2 ...
##  $ weekday.alcohol.consumption       : int  5 1 1 1 1 2 1 3 1 1 ...
##  $ weekend.alcohol.consumption       : int  5 1 1 1 2 5 1 4 4 2 ...
##  $ health.status                     : int  5 4 4 3 3 5 4 2 5 5 ...
##  $ school.absences                   : int  16 0 0 6 2 4 2 6 11 0 ...
##  $ first.period.grade                : int  10 9 14 18 10 6 14 10 12 7 ...
##  $ second.period.grade               : int  12 10 15 18 10 9 12 10 11 5 ...
##  $ final.grade                       : int  11 10 15 18 11 8 13 10 11 0 ...
##  $ total_alcohol_consumption         : num  10 2 2 2 3 7 2 7 5 3 ...
##  $ parents_education                 : num  8 4 8 4 7 3 8 4 6 5 ...
##  $ weekday.alcohol.consumption.factor: Factor w/ 5 levels "1","2","3","4",..: 5 1 1 1 1 2 1 3 1 1 ...
##  $ weekend.alcohol.consumption.factor: Factor w/ 5 levels "1","2","3","4",..: 5 1 1 1 2 5 1 4 4 2 ...
##  $ going.out.with.friends.factor     : Factor w/ 5 levels "1","2","3","4",..: 5 2 4 2 3 5 4 5 2 2 ...
str(test_data)
## 'data.frame':    209 obs. of  38 variables:
##  $ school                            : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ gender                            : Factor w/ 2 levels "F","M": 1 2 1 2 2 2 1 2 1 1 ...
##  $ age                               : int  18 16 16 16 16 15 16 16 16 15 ...
##  $ address                           : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
##  $ family.size                       : Factor w/ 2 levels "GT3","LE3": 1 2 1 2 2 1 2 1 1 1 ...
##  $ parents.cohabitation.status       : Factor w/ 2 levels "A","T": 1 2 2 2 1 2 1 2 1 1 ...
##  $ mothers.education                 : int  4 2 4 2 3 4 3 4 2 4 ...
##  $ fathers.education                 : int  4 2 4 2 4 4 3 3 1 3 ...
##  $ mothers.job                       : Factor w/ 5 levels "at_home","health",..: 1 3 4 3 4 4 3 2 3 4 ...
##  $ fathers.job                       : Factor w/ 5 levels "at_home","health",..: 5 3 4 3 3 4 4 4 3 4 ...
##  $ reason.for.school.selection       : Factor w/ 4 levels "course","home",..: 1 2 4 4 2 4 2 4 3 4 ...
##  $ guardian                          : Factor w/ 3 levels "father","mother",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ traveltime                        : int  2 1 1 2 1 2 1 1 1 1 ...
##  $ studytime                         : int  2 2 3 2 2 2 2 4 2 2 ...
##  $ past.class.failures               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ extra.school.support              : Factor w/ 2 levels "no","yes": 2 1 1 1 2 1 1 1 1 1 ...
##  $ family.education.support          : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 2 1 1 2 ...
##  $ extra.paid.classes                : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 1 1 2 2 ...
##  $ extra.curricular.activities       : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 2 2 2 ...
##  $ attended.nursery.school           : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ higher.education                  : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ internet.access                   : Factor w/ 2 levels "no","yes": 1 2 2 2 2 2 2 2 2 2 ...
##  $ romantic.relationship             : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 2 1 ...
##  $ quality.of.family.relationships   : int  4 4 3 5 5 4 2 4 5 4 ...
##  $ freetime                          : int  3 4 2 4 3 3 3 2 3 3 ...
##  $ going.out.with.friends            : int  4 4 3 4 3 1 5 2 4 2 ...
##  $ weekday.alcohol.consumption       : int  1 1 1 2 1 1 1 1 1 1 ...
##  $ weekend.alcohol.consumption       : int  1 1 2 4 1 1 4 1 1 1 ...
##  $ health.status                     : int  3 3 2 5 5 5 3 2 2 1 ...
##  $ school.absences                   : int  6 0 6 0 4 0 12 4 8 0 ...
##  $ first.period.grade                : int  5 12 13 13 11 17 11 19 8 14 ...
##  $ second.period.grade               : int  6 12 14 13 11 16 12 19 9 15 ...
##  $ final.grade                       : int  6 11 14 12 11 17 11 20 10 15 ...
##  $ total_alcohol_consumption         : num  2 2 3 6 2 2 5 2 2 2 ...
##  $ parents_education                 : num  8 4 8 4 7 8 6 7 3 7 ...
##  $ weekday.alcohol.consumption.factor: Factor w/ 5 levels "1","2","3","4",..: 1 1 1 2 1 1 1 1 1 1 ...
##  $ weekend.alcohol.consumption.factor: Factor w/ 5 levels "1","2","3","4",..: 1 1 2 4 1 1 4 1 1 1 ...
##  $ going.out.with.friends.factor     : Factor w/ 5 levels "1","2","3","4",..: 4 4 3 4 3 1 5 2 4 2 ...

Linear regression model to understand the relation beteween previous grades contribute and final grade of student?

student_grade_model_1 = lm(final.grade ~ first.period.grade + second.period.grade, data = train_data)
summary(student_grade_model_1)
## 
## Call:
## lm(formula = final.grade ~ first.period.grade + second.period.grade, 
##     data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.9324 -0.3781  0.1006  0.8585  6.0676 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -1.09930    0.21364  -5.146 3.33e-07 ***
## first.period.grade   0.13619    0.03572   3.812 0.000148 ***
## second.period.grade  0.96698    0.03231  29.930  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.587 on 832 degrees of freedom
## Multiple R-squared:  0.8338, Adjusted R-squared:  0.8334 
## F-statistic:  2087 on 2 and 832 DF,  p-value: < 2.2e-16

Observations:

In the above model, it can be observed that p-value : < 2.2e-16 which is smaller than 0.05. It means that either of above independent variables is highly correlated to the dependent variable, i.e final grade of students. It can be noted that first period grade and second period grade is significantly associated with the final grade. can determine that students who perform well on first will do better on second and final grade

Linear regression model to understand relation between total alcohol consumption, studytime and final grades of students.
studytime_walc_model = lm(final.grade ~ studytime + total_alcohol_consumption , data = train_data)
summary(studytime_walc_model)
## 
## Call:
## lm(formula = final.grade ~ studytime + total_alcohol_consumption, 
##     data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.9759  -1.5898   0.2649   2.4184   8.1032 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                10.5274     0.4692  22.435  < 2e-16 ***
## studytime                   0.6930     0.1605   4.317 1.77e-05 ***
## total_alcohol_consumption  -0.1618     0.0686  -2.358   0.0186 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.823 on 832 degrees of freedom
## Multiple R-squared:  0.03495,    Adjusted R-squared:  0.03263 
## F-statistic: 15.07 on 2 and 832 DF,  p-value: 3.739e-07
grade_model <- lm(final.grade ~ past.class.failures+studytime+higher.education+extra.school.support+internet.access+going.out.with.friends+romantic.relationship,data = train_data)
summary(grade_model)
## 
## Call:
## lm(formula = final.grade ~ past.class.failures + studytime + 
##     higher.education + extra.school.support + internet.access + 
##     going.out.with.friends + romantic.relationship, data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.9535  -1.5050   0.3338   2.1313   7.6352 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               9.63584    0.65069  14.809  < 2e-16 ***
## past.class.failures      -1.98850    0.19194 -10.360  < 2e-16 ***
## studytime                 0.47060    0.14798   3.180  0.00153 ** 
## higher.educationyes       1.45080    0.45190   3.210  0.00138 ** 
## extra.school.supportyes  -1.25310    0.39132  -3.202  0.00142 ** 
## internet.accessyes        0.67877    0.30025   2.261  0.02404 *  
## going.out.with.friends   -0.07181    0.10738  -0.669  0.50386    
## romantic.relationshipyes -0.62254    0.25592  -2.433  0.01520 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.489 on 827 degrees of freedom
## Multiple R-squared:  0.201,  Adjusted R-squared:  0.1942 
## F-statistic: 29.71 on 7 and 827 DF,  p-value: < 2.2e-16

Linear regression model for all the independent variables with final grade obtained by students.

model =lm(final.grade~., data= train_data)
summary(model)
## 
## Call:
## lm(formula = final.grade ~ ., data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.9156 -0.5030  0.1236  0.7887  5.6795 
## 
## Coefficients: (5 not defined because of singularities)
##                                         Estimate Std. Error t value Pr(>|t|)
## (Intercept)                           -0.4663462  1.1071484  -0.421  0.67371
## schoolMS                               0.1852471  0.1517368   1.221  0.22251
## genderM                               -0.0120514  0.1312554  -0.092  0.92687
## age                                   -0.0530898  0.0524889  -1.011  0.31211
## addressU                               0.1663470  0.1404447   1.184  0.23660
## family.sizeLE3                         0.0276404  0.1279574   0.216  0.82903
## parents.cohabitation.statusT          -0.0898500  0.1844827  -0.487  0.62637
## mothers.education                      0.0010577  0.0797069   0.013  0.98942
## fathers.education                     -0.0394428  0.0712009  -0.554  0.57976
## mothers.jobhealth                      0.0499081  0.2839346   0.176  0.86052
## mothers.jobother                      -0.0223596  0.1713440  -0.130  0.89621
## mothers.jobservices                    0.0743007  0.2027081   0.367  0.71406
## mothers.jobteacher                     0.1035629  0.2681834   0.386  0.69948
## fathers.jobhealth                     -0.1215383  0.3787235  -0.321  0.74836
## fathers.jobother                      -0.2184594  0.2434501  -0.897  0.36981
## fathers.jobservices                   -0.3523352  0.2558579  -1.377  0.16888
## fathers.jobteacher                    -0.3884929  0.3502331  -1.109  0.26767
## reason.for.school.selectionhome       -0.1458425  0.1472554  -0.990  0.32228
## reason.for.school.selectionother      -0.2348689  0.1955910  -1.201  0.23018
## reason.for.school.selectionreputation -0.0564394  0.1542963  -0.366  0.71462
## guardianmother                         0.1638691  0.1400655   1.170  0.24238
## guardianother                          0.1643582  0.2670234   0.616  0.53839
## traveltime                             0.0741328  0.0875349   0.847  0.39731
## studytime                             -0.0457891  0.0741487  -0.618  0.53706
## past.class.failures                   -0.2540944  0.0992845  -2.559  0.01068
## extra.school.supportyes                0.0847372  0.1929170   0.439  0.66061
## family.education.supportyes            0.2939769  0.1233044   2.384  0.01736
## extra.paid.classesyes                 -0.2774266  0.1429470  -1.941  0.05264
## extra.curricular.activitiesyes        -0.2758952  0.1176846  -2.344  0.01931
## attended.nursery.schoolyes            -0.1215181  0.1424381  -0.853  0.39385
## higher.educationyes                   -0.2254339  0.2185704  -1.031  0.30267
## internet.accessyes                    -0.0070013  0.1479869  -0.047  0.96228
## romantic.relationshipyes              -0.0643190  0.1223258  -0.526  0.59918
## quality.of.family.relationships        0.0816132  0.0625098   1.306  0.19207
## freetime                               0.0007795  0.0603866   0.013  0.98970
## going.out.with.friends                 0.0718472  0.0701772   1.024  0.30625
## weekday.alcohol.consumption           -0.1287653  0.1137774  -1.132  0.25809
## weekend.alcohol.consumption            0.1456309  0.0853507   1.706  0.08835
## health.status                         -0.0072131  0.0411645  -0.175  0.86095
## school.absences                        0.0286119  0.0091740   3.119  0.00188
## first.period.grade                     0.1544142  0.0375032   4.117 4.24e-05
## second.period.grade                    0.9455785  0.0331597  28.516  < 2e-16
## total_alcohol_consumption                     NA         NA      NA       NA
## parents_education                             NA         NA      NA       NA
## weekday.alcohol.consumption.factor2   -0.2259466  0.1856214  -1.217  0.22388
## weekday.alcohol.consumption.factor3    0.1916335  0.3049241   0.628  0.52988
## weekday.alcohol.consumption.factor4   -0.1956194  0.4455736  -0.439  0.66076
## weekday.alcohol.consumption.factor5           NA         NA      NA       NA
## weekend.alcohol.consumption.factor2   -0.2448370  0.1592861  -1.537  0.12467
## weekend.alcohol.consumption.factor3   -0.1745679  0.2006880  -0.870  0.38465
## weekend.alcohol.consumption.factor4   -0.0904112  0.2567847  -0.352  0.72487
## weekend.alcohol.consumption.factor5           NA         NA      NA       NA
## going.out.with.friends.factor2         0.1850878  0.2071372   0.894  0.37184
## going.out.with.friends.factor3         0.3359318  0.1690766   1.987  0.04729
## going.out.with.friends.factor4         0.1250463  0.1782406   0.702  0.48316
## going.out.with.friends.factor5                NA         NA      NA       NA
##                                          
## (Intercept)                              
## schoolMS                                 
## genderM                                  
## age                                      
## addressU                                 
## family.sizeLE3                           
## parents.cohabitation.statusT             
## mothers.education                        
## fathers.education                        
## mothers.jobhealth                        
## mothers.jobother                         
## mothers.jobservices                      
## mothers.jobteacher                       
## fathers.jobhealth                        
## fathers.jobother                         
## fathers.jobservices                      
## fathers.jobteacher                       
## reason.for.school.selectionhome          
## reason.for.school.selectionother         
## reason.for.school.selectionreputation    
## guardianmother                           
## guardianother                            
## traveltime                               
## studytime                                
## past.class.failures                   *  
## extra.school.supportyes                  
## family.education.supportyes           *  
## extra.paid.classesyes                 .  
## extra.curricular.activitiesyes        *  
## attended.nursery.schoolyes               
## higher.educationyes                      
## internet.accessyes                       
## romantic.relationshipyes                 
## quality.of.family.relationships          
## freetime                                 
## going.out.with.friends                   
## weekday.alcohol.consumption              
## weekend.alcohol.consumption           .  
## health.status                            
## school.absences                       ** 
## first.period.grade                    ***
## second.period.grade                   ***
## total_alcohol_consumption                
## parents_education                        
## weekday.alcohol.consumption.factor2      
## weekday.alcohol.consumption.factor3      
## weekday.alcohol.consumption.factor4      
## weekday.alcohol.consumption.factor5      
## weekend.alcohol.consumption.factor2      
## weekend.alcohol.consumption.factor3      
## weekend.alcohol.consumption.factor4      
## weekend.alcohol.consumption.factor5      
## going.out.with.friends.factor2           
## going.out.with.friends.factor3        *  
## going.out.with.friends.factor4           
## going.out.with.friends.factor5           
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.571 on 784 degrees of freedom
## Multiple R-squared:  0.8465, Adjusted R-squared:  0.8367 
## F-statistic: 86.46 on 50 and 784 DF,  p-value: < 2.2e-16

Observations:

  1. From above regression model, we got R square value = 0.84 which is 84%.
  2. We can observe that p-value is less than 0.05 for the variables : traveltime , past.class.failures, extra.paid.classesyes, school.absences ,first.period.grade, second.period.grade, weekday.alcohol.consumption, weekend.alcohol.consumption.
test_data$final.grade.predicted <- predict(object = model,newdata = test_data)
## Warning in predict.lm(object = model, newdata = test_data): prediction from a
## rank-deficient fit may be misleading
test_data$final.grade
##   [1]  6 11 14 12 11 17 11 20 10 15 16 11 10 12 10  7 17 10 13 16 14  8 11  0 12
##  [26] 15 11 12 13  0 16 12 15 13  9  8 10 16 10 18 16  9  7  8  4  6 17  7  9 12
##  [51] 14 13 11  0 12  0 12 13 14  9  0 17 10 10 11 14 12  8 10  9 13  8 10 10 19
##  [76] 14 10 14 13 14  7 14 12 12 11 11 11  9 12 16 16 12 10 11 14 12 11 12 13 18
## [101] 14 11  9 11 12 11 11 11 13  6  0  8 11 11  9 18 13 10 11 16 13 11 15 13 11
## [126] 14 16 11 16 13 14 10  8 16 13 11 10  6 12 13 14 12 12 12 10 14 12 15 17 13
## [151] 18 15 17 14 17 14 10 10 15 12 17 14 12 15 18 11  9 11 13 11 13 10 12 12 16
## [176] 10 14 10  9 13 10 18 10 15 16  9 10 11 12 10 10  9  8 10 11  8  0 18 14  0
## [201] 12  0 14 10 12  9 17 12 11
test_data$final.grade.predicted
##   [1]  5.51435922 12.04248661 13.87829686 13.38305348 11.47388837 16.33415909
##   [7] 12.58430917 19.37423819  8.31207301 15.00091047 16.99132618  8.60100485
##  [13]  7.00453515 11.50323561  9.63939403  7.53157761 16.99715554  8.90528897
##  [19] 12.23039537 15.21920287 12.84052185  6.40678958  9.21255185  0.39103735
##  [25] 12.42305850 16.25008744 11.08915832 10.79067920 12.59866980  6.52945072
##  [31] 14.89648738 13.06522262 14.93395628 12.37325854  7.59052521  7.17811429
##  [37]  8.88240122 15.68611624  9.09766871 19.03774662 17.07287244  9.49757712
##  [43]  5.81101529  6.24403188  5.07683851  5.20588229 15.86239306  7.71485884
##  [49]  8.35302364 10.28869469 13.07892783 12.67485593 10.52050829  6.77793330
##  [55] 12.42107443 -0.07667694 11.42403586 12.34485628 14.22914401  8.75979456
##  [61]  9.22360809 17.79834397 10.13989354  8.57653330  9.96767351 14.04960534
##  [67] 12.41016695  7.74524413 10.40559414  9.62553103 13.33738981  7.89813577
##  [73]  9.97679192 10.07623471 18.85864501 14.11672951 10.70432290 13.94720232
##  [79] 12.26530368 13.32459251  6.75939215 12.74850964 11.14044690 10.73900724
##  [85] 11.71466612 10.69992549 10.90908180  9.49499387 12.35592454 15.76799881
##  [91] 15.96893484 11.82671328  9.47609935 11.42461306 12.55047438 10.93227402
##  [97] 11.81145284 12.19988143 13.71723402 18.19125607 14.40584575 11.08418696
## [103]  8.52645962  9.67929636 10.19445120 11.59807827 11.19033155 10.76600551
## [109] 13.17097718  7.69407365  8.83149384  6.89493058 10.26759661 10.37814801
## [115]  6.69885002 17.47108973 12.85661308  9.15631513  9.25536133 15.37096320
## [121] 12.41322511 10.08938602 14.63141807 13.67690888  9.70304345 12.50519116
## [127] 16.37874342 10.02385032 15.19371085 12.74672886 13.05858905 10.38416455
## [133]  7.64423230 16.97107188 11.62180458 11.82950832 11.53252194  7.34349680
## [139]  9.90224275 11.70102007 10.91745544 10.20374185 11.99970642 12.55170606
## [145]  9.30325886 12.65938642 12.07903337 14.94652489 17.92284489 11.83295361
## [151] 18.60033508 12.48719417 15.26370874 12.25278529 15.15195188 11.19471086
## [157] 10.61877944  8.84222069 14.57577590 10.61733111 14.09238214 13.35734987
## [163] 10.05119713 14.30080175 18.92729907 11.40754467  8.80591722  9.75220173
## [169] 13.12376768 10.33351858 10.65987986  9.45224355 12.76029133 11.03984040
## [175] 14.17272760  8.19339249 13.26617401  9.80032287  8.79922387 12.95882662
## [181] 11.30231429 17.92255635  8.75833975 14.58790446 14.59199121  8.93614246
## [187] 11.30846495 11.15881454 11.24483041 10.78926913  8.08178568  9.83693916
## [193]  8.03855245  9.08324995  8.89660104  7.27389069  7.98752644 18.45544067
## [199] 13.65115520 -0.76768672 11.47811232  0.21543678 13.08569443  9.59466610
## [205] 11.03042595  7.46453687 17.29540982 10.82380163  8.56635482

Performance of linear regression model using RMSE (Root mean square error)

# RMSE = Computes to the average difference between actual and predicted value.
rmse = sqrt(mean((test_data$final.grade - test_data$final.grade.predicted)^2) )
rmse
## [1] 1.614811

Standard error rate of the model

# Standard error rate of model = rmse/mean(actual dependent value) * 100
std_error_rate_lm = rmse/mean(test_data$final.grade)*100
std_error_rate_lm
## [1] 13.9116

Observations:

From above we can make out that our RMSE value is 1.6 representing an error rate of 13%, which is good also r-square value for the linear regression model is 82%.

Q3. Analysis of grades and alcohol consumption between 2 classes i.e Math and Portugese class.

math_grades <- mat_students %>%
  gather(`first.period.grade`, `second.period.grade`, `final.grade`, key="semester", value="grade") %>%
  ggplot() +
  geom_bar(aes(x=grade, fill=semester), position="dodge") + 
  ggtitle("Distribution of three grades in Math")

por_grades <- por_students %>%
  gather(`first.period.grade`, `second.period.grade`, `final.grade`, key="semester", value="grade") %>%
  ggplot() +
  geom_bar(aes(x=grade, fill=semester), position="dodge") + 
  ggtitle("Distribution of three grades in Portuguese")

grid.arrange(math_grades,por_grades)

Average of final grades for students in Math class

mean(mat_students$final.grade)
## [1] 10.41519

Average of final grades for students in Portugese class

mean(por_students$final.grade)
## [1] 11.90601

Observations:

From above graph, we can illustrate that there is not much difference grades of student in both the classes. But we can see that there were more number of students who scored 0 in final exam for Math class as compared to Portuguese class. Also, average grades of students is more for Portuguese class in comparison with Math class.

Distribution of students grades by school
math_school_grades <- ggplot(mat_students) +
  geom_bar(aes(x=school, fill=as.factor(final.grade)), position="dodge") +
  ggtitle("Maths grades by school") +
  theme(legend.position = "none")

port_school_grades <- ggplot(por_students) +
  geom_bar(aes(x=school, fill=as.factor(final.grade)), position="dodge") +
  ggtitle("Portuguese grades by school") + 
  theme(legend.position = "none")

grid.arrange(math_school_grades, port_school_grades)

Observations:

Similar trends can be observed in both the classes, but average grades of Gabriel Pereira (GP) school is more that the Mousinho da Silveira (MS) school.

Find school names and replace it with the statement above.
math_school_grades <- ggplot(mat_students, aes(x=final.grade)) +geom_density(aes(color=school),linetype = "dashed", size = 0.7) +
  ggtitle("Maths students' grades by school")+
  scale_color_manual(values = c("#868686FF", "#EFC000FF"))+
  scale_fill_manual(values = c("#868686FF", "#EFC000FF"))


port_school_grades <- ggplot(por_students, aes(x=final.grade)) +
  geom_density(aes(color=school),linetype = "dashed", size = 0.7) +
  ggtitle("Portuguese students grades by school")+
  scale_color_manual(values = c("#868686FF", "#EFC000FF"))+
  scale_fill_manual(values = c("#868686FF", "#EFC000FF"))

grid.arrange(math_school_grades, port_school_grades)

Observations:

It can be observed a similar trend for Math class in both the schools, but students from GP school are seen outperfoming as compared to MS school in Portuguese class.

Observations:

Students living in urban areas tend to perform well as compared with the rural areas in both Math and Portuguese class.

Distribution of students grades w.r.t to their alcohol consumption

mat_alcohol_level = ggplot(mat_students, aes(x=weekday.alcohol.consumption)) +
  geom_density(aes(color=school),linetype = "dashed", size = 0.7) +
  ggtitle("Maths students grades by school")+
  scale_color_manual(values = c("#868686FF", "#EFC000FF"))+
  scale_fill_manual(values = c("#868686FF", "#EFC000FF"))

port_alcohol_level = ggplot(por_students, aes(x=weekday.alcohol.consumption)) +
  geom_density(aes(color=school),linetype = "dashed", size = 0.7) +
  ggtitle("Portuguese students grades by school")+
  scale_color_manual(values = c("#868686FF", "#EFC000FF"))+
  scale_fill_manual(values = c("#868686FF", "#EFC000FF"))

grid.arrange(mat_alcohol_level,port_alcohol_level)

Alcohol consumption level of Maths and Portuguese students

# Calculate total alcohol consumption for Math and Portuguese class
mat_students$total.alcohol.consumption = rowSums(cbind(mat_students$weekday.alcohol.consumption, mat_students$weekend.alcohol.consumption))

por_students$total.alcohol.consumption = rowSums(cbind(por_students$weekday.alcohol.consumption, por_students$weekend.alcohol.consumption))

Total alcohol consumption level for Math class

hist(mat_students$weekday.alcohol.consumption, main="Alcoholic Beverages Consumed on Weekdays by Math class",xlab="Students", border="blue", col="purple",ylab="Weekely alcohol frequency")

hist(mat_students$weekend.alcohol.consumption, main="Alcoholic Beverages Consumed on Weekends by Math class",xlab="Students", border="blue", col="purple",ylab="Weekely alcohol frequency")

Total alcohol consumption level for Portuguese class

hist(por_students$weekday.alcohol.consumption, main="Alcoholic Beverages Consumed on Weekdays by Portuguese class",xlab="Students", border="blue", col="green",ylab="Weekely alcohol frequency")

hist(por_students$weekend.alcohol.consumption, main="Alcoholic Beverages Consumed on Weekends by Portugueseclass",xlab="Students", border="blue", col="green",ylab="Weekely alcohol frequency")

Mean weekly alcohol consumption level of Math students

mean(mat_students$weekday.alcohol.consumption)
## [1] 1.481013

Mean weekend alcohol consumption level of Math students

mean(mat_students$weekend.alcohol.consumption)
## [1] 2.291139

Mean weekly alcohol consumption level of Portuguese students

mean(por_students$weekday.alcohol.consumption)
## [1] 1.502311

Mean weekend alcohol consumption level of Portuguese students

mean(por_students$weekend.alcohol.consumption)
## [1] 2.280431

Alcohol consumption level for both the classes tend to be similar for both during weekdays and weekends.

What is the grades of both the classes w.r.t their family relations?

math_class_family <- ggplot(mat_students, aes(x=final.grade)) +
  geom_density(aes(color=as.factor(quality.of.family.relationships))) +
  ggtitle("Maths students grades by family relationships")

port_class_family <- ggplot(por_students, aes(x=final.grade)) +
  geom_density(aes(color=as.factor(quality.of.family.relationships))) +
  ggtitle("Portuguese students grades by family relationships")

grid.arrange(math_class_family, port_class_family)

Observations:

Surprisingly, students with the poorest family relationships had a higher overall math score than students with stronger relationships. Portuguese, on the other hand, is the exact opposite.

Conclusion:

Based off the data analysis we have concluded that weekday student alcohol consumption is correlated with the amount of freetime a student has, the amount of time they study, how often they go out with their friends, the quality of their family relationships, and weekend alcohol consumption. This information can be useful for the secondary school in that they can create after school programs, provide counseling services, to try to minimize the amount of freetime students have that can lead to alcohol consumption. Additionally, since weekday and weekend consumption is statistically significant to one another parent scan focus on ways to reduce weekday consumption by providing better guidance and better activities for students during their freetime.

Next, based off our data analysis on student performance we can conclude that past class failures, school absences, first period grade, second period grade, extra curricular activities and educational support from family are statistically significant to the students final grade. This will allow the school to focus their attention on the material that students need most help on so that they can achieve better grades.

Lastly, after reviewing the difference between performance in both classes our data analysis shows that students from the Portuguese language class on average obtained better grades than the Math class. However, both classes consumed about the same amount number of alcoholic beverages during the week and both classes showed an increase over the weekend.

In conclusion school administrators can use this information to create more opportunities for after-school activities, and provide better tutoring services to students who fall behind at the start of the period.